Tesseract-OCR 5.0LSTM训练流程

下载与安装

下载 Tesseract-OCR的Windows版本
下载各版本对应字库下载地址：https://github.com/tesseract-ocr/tessdata_best，要识别简体中文需要下载chi_sim.traindata字库
下载工具jTessBoxEditor

训练

步骤一、首先，准备足够多的训练图片
通过jTessBoxEditor，将这些图片合并成一个文件,，打开jTessBoxEditor，点开train.bat。在菜单栏中Tools->Merge TIFF。
生成box文件

tesseract yzm_ocr.font.exp0.tif yzm_ocr.font.exp0 --psm 6 batch.nochop makebox

_Make Box File的命令格式：tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

其中lang为语言名称，fontname为字体名称，num为序号，可以随便定义。

注意：tif文件名必须与box文件名保持一致，且位于同一个目录。这么理解吧：TIF文件用来保存图像（可以保存多张图片，用Page来分页）；BOX文件用来保存图像中文字的位置信息（每个字符的X、Y坐标，宽度、高度）。_

运行jTessBoxEditor工具文字校正
利用.tif和.box文件生成.lstmf文件用于lstm训练

tesseract yzm_ocr.font.exp0.tif yzm_ocr.font.exp0 --psm 6 nobatch lstm.train

从已有的.traineddata中提取.lstm文件

combine_tessdata -e enm.traineddata enm.lstm

创建num.training_files.txt文件，里边的内容为.lstmf文件的路径地址
进行训练

lstmtraining --model_output="C:\Users\lixiaojun\Desktop\yyyy\output\output" --continue_from="C:\Users\lixiaojun\Desktop\yyyy\enm.lstm" --train_listfile="C:\Users\lixiaojun\Desktop\yyyy\yzm_ocr.training_files.txt" --traineddata="C:\Users\lixiaojun\Desktop\yyyy\enm.traineddata" --debug_interval -1 --max_iterations 800

将checkpoint文件和.traineddata文件合并成新的.traineddata文件

lstmtraining --stop_training --continue_from="C:\Users\lixiaojun\Desktop\yyyy\output\output_checkpoint" --traineddata="C:\Users\lixiaojun\Desktop\yyyy\enm.traineddata" --model_output="C:\Users\lixiaojun\Desktop\yyyy\output\yzm_ocr.traineddata"


// 输入命令，查看psm的参数

tesseract --help-psm

  0    Orientation and script detection (OSD) only.

  1    Automatic page segmentation with OSD.

  2    Automatic page segmentation, but no OSD, or OCR.

  3    Fully automatic page segmentation, but no OSD. (Default)

  4    Assume a single column of text of variable sizes.

  5    Assume a single uniform block of vertically aligned text.

  6    Assume a single uniform block of text.

  7    Treat the image as a single text line.

  8    Treat the image as a single word.

  9    Treat the image as a single word in a circle.

10    Treat the image as a single character.

翻译:

0 定向脚本监测（OSD）

1 使用OSD自动分页

2 自动分页，但是不使用OSD或OCR（Optical Character Recognition，光学字符识别）

3 全自动分页，但是没有使用OSD（默认）

4 假设可变大小的一个文本列。

5 假设垂直对齐文本的单个统一块。

6 假设一个统一的文本块。

7 将图像视为单个文本行。

8 将图像视为单个词。

9 将图像视为圆中的单个词。

10 将图像视为单个字符

I'm XiaoBird

Tesseract-OCR 5.0LSTM训练流程

Comments